Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches
نویسندگان
چکیده
We describe two adaptation strategies which are used in our word segmentation system in participating the Microblog word segmentation bake-off: Domain invariant information is extracted from the in-domain unlabelled corpus, and is incorporated as supplementary features to conventional word segmenter based on Conditional Random Field (CRF), we call it statistic-based adaptation. Some heuristic rules are further used to post-process the word segmentation result in order to better handle the characters in emoticons, name entities and special punctuation patterns which extensively exist in micro-blog text, and we call it rule-based adaptation. Experimentally, using both adaptation strategies, our system achieved 92.46 points of F-score, compared with 88.73 points of F-score of the unadapted CRF word segmenter on the pre-released development data. Our system achieved 92.51 points of F-score on the final test data.
منابع مشابه
CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data
In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development o...
متن کاملChinese Word Segmentation for Terrorism-Related Contents
In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and ...
متن کاملWord Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff
This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...
متن کاملA Feature-Rich CRF Segmenter for Chinese Micro-Blog
This paper describes our system for Chinese word segmentation of micro-blog text, one of the NLPCC-ICCPOL 2016 Shared Tasks [1]. The CRF (Conditional Random Field) model is employed to model word segmentation as a sequence labeling problem, 7 sets of features are selected to train the CRF model. The system achieves fb 0.798144 on closed track, 0.81968 on semi-open track, and 0.82217 on open tra...
متن کاملChinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in rea...
متن کامل